CI: system tests: enable parallel tests #23989

edsantiago · 2024-09-17T17:25:51Z

Well, here goes. Let's try enabling parallel system tests.

There's still a ton of work left to do, but I think we're ready enough. The remaining work is in #23275, and I will continue submitting piecemeal PRs over time.

We're gonna see flakes. Sorry. The remaining flakes are really hard to track down. OTOH rerun time will be shorter.

Note to reviewers: the only way to handle this is by looking at individual commits.

None

For the past two months we've been splitting system tests into two categories: those that CAN be run in parallel, and those that CANNOT. Much work has been done to replace hardcoded names (mycontainer, mypod) with safename(). Hundreds of test runs, in CI and on Ed's laptop, have proven this approach viable. make {local,remote}system now runs in two steps: first the serial ones, then the parallel ones. hack/bats will now recognize the 'ci:parallel' tag and add --jobs (nprocs). This requires some tweaking of leak_check, because there can be umpteen tests running (affecting image/container/pod/etc state) when any given test completes. Rules for enabling parallelization in tests: * use unique container/pod/volume/network names (safename) * do not run 'podman rm -a' or 'rmi -a' * never use the -l (--latest) option * do not run 'podman ps/images' and expect precise output Signed-off-by: Ed Santiago <[email protected]>

Workaround for containers#23292, where simultaneous 'pod create' commands will all start a podman-build of the pause image, but only one of them will be tagged, and the others will leak <none> images. Signed-off-by: Ed Santiago <[email protected]>

Signed-off-by: Ed Santiago <[email protected]>

For tests run in parallel, show file number as |nnn| (vs [nnn]) Teach logformatter to distinguish the two, adding 'p' to anchors in parallel tests. Necessary because in this scheme we run bats twice, thus see 'ok 1' twice, and we want to differentiate them. Signed-off-by: Ed Santiago <[email protected]>

Add a few best-practices examples, and add a whole section describing the dos and donts of writing parallel-safe tests. Signed-off-by: Ed Santiago <[email protected]>

When running parallel, multiple tests could be trying to start the registry at once. Make this parallel-safe. Also, use a safer port range for the registry. Something outside of /proc/sys/net/ipv4/ip_local_port_range Sorry, I'm including a FIXME section that I haven't investigated deeply enough. Signed-off-by: Ed Santiago <[email protected]>

Need --layers=false in podman build, otherwise a buildah race can trigger "layer not known" failures: containers/buildah#5674 Signed-off-by: Ed Santiago <[email protected]>

Signed-off-by: Ed Santiago <[email protected]>

...for dealing with flakes in parallel mode Signed-off-by: Ed Santiago <[email protected]>

openshift-ci · 2024-09-17T17:25:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2024-09-17T18:01:37Z

LGTM on my end. Let's bite the bullet and get it enabled.

rhatdan · 2024-09-17T19:04:39Z

/lgtm

Luap99

Would have been great if I had gotten a chance to review this first before merging so quickly. Anyhow no major blocker so it is fine.

Luap99 · 2024-09-18T08:54:11Z

.cirrus.yml

-    gce_instance: *standardvm
-    timeout_in: 35m
+    gce_instance: *fastvm
+    timeout_in: 25m


slowest test in this PR was 23:23 so this looks very close to hitting this already, maybe it was as particular bad run but we should not set it to close IMO
I guess it is fine as you keep converting more files to run parallel so we speed up over the next days/weeks?

Luap99 · 2024-09-18T08:59:12Z

test/system/000-TEMPLATE

+#
+# This is an example of what NOT to do when enabling parallel tests.
+#
+# bats test_tags=ci:parallel


This should likely mention the possibility of a file_tag for the whole file? i.e. do not set it if the file already has it set and a author must check if the whole file is parallel

Also examples of what not to do are bad if they do not show what to do instead. This should be in the style of

do not do
run_podman --name mycontainer $IMAGE top
instead do
local cname="c-$(safename)"
run_podman --name $cname $IMAGE top

etc...

test/system/252-quadlet.bats

test/system/helpers.registry.bash

test/system/setup_suite.bash

Luap99 · 2024-09-18T09:19:46Z

test/system/README.md

@@ -5,11 +5,10 @@ debug failures.
 Quick Start
 ===========

-Look at [030-run.bats](030-run.bats) for a simple but packed example.
+Look at [000-TEMPLATE](000-TEMPLATE) for a simple starting point.


There are no docs here in the Readme about parallel tests requirements and I assume many will not look into the template file. I think the parallel testing should have its own section here

edsantiago · 2024-09-18T11:26:00Z

I was expecting a week of back-and-forth in review. I tried to be diligent about cleaning up my test-only comments and code, but missed some. I will clean those up today. Thank you.

(Documentation will be lower priority. I still have a few more tests to submit)

edsantiago added 9 commits September 17, 2024 09:25

clean_setup: create pause image

bcffa9c

Workaround for containers#23292, where simultaneous 'pod create' commands will all start a podman-build of the pause image, but only one of them will be tagged, and the others will leak <none> images. Signed-off-by: Ed Santiago <[email protected]>

ci: bump system tests to fastvm

6b621d9

Signed-off-by: Ed Santiago <[email protected]>

Update system test template and README

bf61317

Add a few best-practices examples, and add a whole section describing the dos and donts of writing parallel-safe tests. Signed-off-by: Ed Santiago <[email protected]>

Add workaround for buildah parallel bug

b3da5be

Need --layers=false in podman build, otherwise a buildah race can trigger "layer not known" failures: containers/buildah#5674 Signed-off-by: Ed Santiago <[email protected]>

Add network namespace leak check

7fcf94d

Signed-off-by: Ed Santiago <[email protected]>

Misc minor test fixes

8402b65

...for dealing with flakes in parallel mode Signed-off-by: Ed Santiago <[email protected]>

openshift-ci bot added the release-note-none label Sep 17, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2024

openshift-ci bot assigned rhatdan Sep 17, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 17, 2024

openshift-merge-bot bot merged commit 4dfff40 into containers:main Sep 17, 2024
88 checks passed

edsantiago deleted the enable-bats-parallel branch September 18, 2024 00:07

Luap99 reviewed Sep 18, 2024

View reviewed changes

edsantiago mentioned this pull request Oct 29, 2024

CI: parallel sys tests: teardown: registry socket still seems open #23677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: system tests: enable parallel tests #23989

CI: system tests: enable parallel tests #23989

edsantiago commented Sep 17, 2024

openshift-ci bot commented Sep 17, 2024

mheon commented Sep 17, 2024

rhatdan commented Sep 17, 2024

Luap99 left a comment

Luap99 Sep 18, 2024

Luap99 Sep 18, 2024

Luap99 Sep 18, 2024

edsantiago commented Sep 18, 2024

CI: system tests: enable parallel tests #23989

CI: system tests: enable parallel tests #23989

Conversation

edsantiago commented Sep 17, 2024

openshift-ci bot commented Sep 17, 2024

mheon commented Sep 17, 2024

rhatdan commented Sep 17, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Luap99 Sep 18, 2024

Choose a reason for hiding this comment

Luap99 Sep 18, 2024

Choose a reason for hiding this comment

Luap99 Sep 18, 2024

Choose a reason for hiding this comment

edsantiago commented Sep 18, 2024